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1. INTRODUCTION 

Heart disease (HD) is one of the world's most serious diseases nowadays, which shows the difficulty 
in identifying it, therefore making it a convenient time for both physicians and patients. Early detection of 
HD aids in the improvement of patients' health through preventative measures. Cardiovascular disease 
(CVD) refers to a group of diseases affecting the heart and blood arteries that cause 13% of CVD deaths, 
while tobacco is responsible for 9%, diabetes for 6%, lack of exercise for 6%, and obesity for 5% [1]. 
CardioHelp is a technology that uses a deep learning algorithm called convolutional neural networks to 
estimate the probability of a patient having a cardiovascular illness (CNN) [2]. 

Figure 1 displays the outer view of the human heart system, where the right atrium is the area where 
blood returns to the heart from the rest of the body. The blood has been delivering oxygen to the body and 
now has to be replenished. This blood fills the right atrium, which subsequently goes into the right ventricle. 
The right ventricle will pump blood into the lungs to replenish oxygen levels. The right ventricle contracts 
when it is full, propelling blood into the lungs. The blood travels from the lungs to the left atrium, and the 
right atrium pumps oxygen-depleted blood into the right ventricle, whereas the left atrium pushes oxygenated 
blood into the left ventricle from the previous cycle. The blood flows back to the right atrium, and the cycle 
begins again. Therefore, early detection of heart problems like coronary artery disease, arrhythmias, infection 
or defect in the heart, and disease in the heart muscle are much important both for the physicians as well as 
patients. 
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Thus, machine learning (ML) is one of the most effective ways of predicting whether or not a person 
is suffering from a specific ailment in the field of medical analysis [3], [4]. The greatest algorithm for 
predicting a person's heart illness is the random forest classification algorithm, which is the most accurate 
way to perform analysis [5]. Data-driven approaches based on ML algorithms such as K-nearest neighbor (K- 
NN), decision tree (DT), logistic regression (LR), and many more are viable alternatives [6]-[8]. In the 
coming section 2, the methodology has been described to predict HD; section 3, provides the performance 
result of all models; finally, sections 5 and 6 concluded with results and conclusion respectively. 
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Figure 1. Outer view of the human heart 


2. RESEARCH METHOD 

The electronic health record (EHR) is one of the strategies for keeping track of a patient's whole 
medical history and analyzing the data in the future. The complete process for comparative analysis and 
prediction for HD undergoes some basic steps, as shown in Figure 2. Although, there have been analyzing the 
heart dataset with different machine learning models, to predict the HD in advance, in avoidance to death. 
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Figure 2. Methodology for heart attack prediction 


2.1. Data collection 

Data collection is the process of gathering, measuring, and evaluating correct insights for research 
using defined distinguished techniques. In this work, the HD dataset has been collected from the UCI 
repository. It contains 303 individual patient records and 14 dependent features, as described in Table 1. 
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Table 1. Features and descriptions 


No. of features Features Description 
1 Chest pain (cp) 0: Typical angina, 1: Atypical angina, 2: Non-anginal pain, and 3: Asymptomatic 
2 Restecg 0: Nothing to note, 1: ST-T Non-normal heartbeat, and 2: Hypertrophy of the left ventricle 
3 thal 1, 3: normal, 6: fixed defect, and 7: reversible defect 
4 trestbps Anything exceeding 130-140 in the resting blood pressure is usually caused worry. 
5 fbs '>126' mg/dL indicates diabetes if fasting bs is greater than 120 mg/dL (1 = true; 0 = false). 
6 age age in years 
7 Sex (1 = male; 0 = female) 
8 chol Serum cholesterol in mg/dl 
9 thalach Attained maximal heart rate 
10 exang exercise induced angina (1 = yes; 0 = no) 
11 oldpeak ST depression generated by exercise compared to rest examines heart stress during activity; 
12 slope The slope of the ST portion of the peak exercise 
13 ca Number of major vessels (0-3) colored with fluoroscopy-colored vessels 
14 output predicted HD or not (1=yes, 0=no) (= the attribute 


2.2. Data pre-processing 

Preprocessing the data is the major step before moving to predict any disease in a healthcare firm 
due to its incompleteness and unreliability, so likely to be riddled with errors. So, there is the requirement of 
cleaning by removing duplicate, null or irrelevant values from the dataset. To make our models train, the HD 
dataset has been splitted into train set and test set (i.e. 80 percent for training and 20 percent for testing). 


2.3. Apply machine learning and deep learning models 
2.3.1. Machine learning 

The most common type of artificial intelligence (AI) is ML which analyses and discovers patterns in 
massive data sets. This aids in decision-making and builds a cost-effective model for correctly predicting 
cardiac disease [9]. The best threshold for a model can be calculated by using Precision-Recall Curve (PRC) 
that has been discussed here. 


a. Naive Bayes classifier 

Naive Bayes (NB) finds the probability of one event occurring based on the probability of another 
event to predict an accurate class. Thus, the HD prediction system accurately identifies and predicts it more 
efficiently based on medical data [10], [11]. Figure 3 shows the result of NB model on test data, where Figure 
3(a) shows the PRC plot and Figure 3(b) gives the result of the confusion matrix that predicts a total of thirty- 
six HD patients. 
Algorithm: 
— Create a frequency table from the dataset, and a Likelihood table by calculating the probabilities. 
— Now applying the NB equation for calculating the posterior probability for each class. 
— Finally, the highest posterior probability will be considered as the predicted outcome. 


Out[41]: Text(@.5, 1.0, 'Precision-Recall Curve for Naive Bayes’) Confusion Matrix 
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Figure 3. Heart disease prediction for NB model using (a) PRC plot and (b) confusion matrix 
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b. Decision tree classifier 

A decision tree (DT) model is mostly used to address the classification problems like determining 
HD using NetBeans and Weka software [12]. DT uses iterative feature elimination to produce accurate 
cardiac disease prediction at an early stage [13]. The PRC and cm for the DT model have been shown in 
Figure 4(a) and Figure 4(b). 
Algorithm 
— Starts from the root node and moves forward to its branch node to find the best attribute. 
— Calculate the Entropy (H) and Information Gain (IG) of each attribute to select the attribute having the 

lowest H and highest IG. On each subset, the process repeats itself (i.e. not selected previously). 

—  Gotostep 1 until getting the outcome. 


QOut[47]: Text(@.5, 1.0, 'Precision-Recall Curve for Decision Tree’) 


Confusion Matrix for Decision Tree 
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Figure 4. Heart disease prediction for DT model using (a) PRC plot and (b) confusion matrix 


c. Random forest 
Random forest (RF) uses the data analysis technique to anticipate and detect CVDs such as stroke 
and heart attack efficiently [14]. Although, RF predicts cardiovascular disorders with a 90.3 percent accuracy 
[15]. The PRC and cm for the RF predict thirty-four HD patients has been shown in Figures 5(a) and 5(b). 
Algorithm 
a. Randomly select the “k” features (i.e. k=13) from the heart dataset from a total number of “m” 
features (i.e. m=14) where k << m. Also, apply DT for every sample. 
b. Consider the text features and apply rules to predict the outcome and store the predicted result. 
c. Find the majority for every predicted result and choose the highest result as the final prediction. 


Confusion Matrix for Random Forest 
Out[53]: Text(@.5, 1.0, 'Precision-Recall Curve for Random Forest’) 
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Figure 5. Heart disease prediction for RF model using (a) PRC plot and (b) confusion matrix 
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d. K-nearest neighbor classifier 

The K-nearest neighbor (K-NN) classifier assumes that the new case/data and previous cases are the 
same, and allocates the new case to the extant categories only in the most similar group. When diagnosing the 
HD dataset, K-NN produces an effective classification and optimal solution than NB and DT [16], [17]. The 
PRC and cm for the K-NN classifier on test data predict thirty-four HD patients that have been shown in 
Figure 6(a) and Figure 6(b). 


Confusion Matrix for KNN 
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Figure 6. Heart disease prediction for K-NN model using (a) PRC plot and (b) confusion matrix 


e. Logistic regression classifier 

Logistic regression (LR) performs on calculating the probability of a categorical dependent variable. 
The dependent variable is a binary variable with data coded as 1 or 0. When the accuracy of LR was 
compared to other models, it was found that it was suitable in the field of HD prediction [18]. However, the 
PRC and cm for the LR model on test data predict thirty-seven HD patients that have been plotted in Figure 
7(a) and Figure 7(b) respectively. 


Confusion Matrix for Logistic Regression 
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Figure 7. Heart disease prediction for LR model using (a) PRC plot and (b) confusion matrix 


f. Support vector machine 

The support vector machine (SVM) algorithm uses a line to separate data points which is called a 
hyperplane. Hyperplane has been used to classify the most significant of the closest data points into two 
separate classes. HD prediction using a PSO-based SVM algorithm outperforms the DT, NB, NN, and SVM 
by a factor of 100 [19]. The PRC and cm plot for the SVM model that predicts thirty-five HD patients has 
been shown in Figure 8(a) and Figure 8(b). 
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Algorithm: 

— Create hyperplanes to separate the dataset into two classes and identify the right hyperplane for SVM. 

— Check if this hyperplane is properly segregated from the dataset. If not, apply the kernel trick to it. 

— | Now check whether the data points are linearly separable or not. 

— Next, find the margin on both sides that are closest to the hyperplane and should be at the maximum 
point. If yes, then this model is now ready for prediction. 


Confusion Matrix for SVM 
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Figure 8. Heart disease prediction for SVM model using (a) PRC plot and (b) confusion matrix 


g. Light gradient boosting machine (LGBM) 

It follows the best-first search to find the optimal path. This model grows trees vertically i.e. leaf- 
wise whereas other decision trees grow horizontally i.e. level-wise. The PRC and cm plot for the SVM 
classifier on test data that predicts a total of thirty-six HD patients has been shown in Figure 9(a) and Figure 
9(b) respectively. 

Algorithm: 

— Create hyperparameters and the number of DT. Next, recall the DTs to improve prediction. 

— Assign the number of trees i.e. here n_estimators is 20 and measure the model’s performance. 

— Explore the tree depth (i.e. here max_depth is ‘5), the number of terminal nodes, and the learning rate. 


Precision-Recall Curve for LGBM Confusion Matrix for LGBM 
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Figure 9. Heart disease prediction for LGBM model using (a) PRC plot and (b) confusion matrix 


2.3.2. Deep learning 

One AI mechanism called deep learning (DL) mimics the network of neurons in a brain. An 
advanced DL model can build a knowledge-rich environment that can assist clinical decision-making for 
predicting CVD patients [20]. 
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a. Neural network 

The neural network (NN) algorithm is based on biological neural networks and aims to mimic the 
nervous system of humans in the learning process. The results reveal that for the HD prediction tasks, a three- 
layer NN model using a backpropagation algorithm and an Adam optimizer reached a promising accuracy 
[21]-[24]. The PRC and cm plot for the NN model predicts thirty-three HD patients, has been shown in 
Figure 10(a) and Figure 10(b). 
Algorithm: 
— Starting from the input layer, each neuron takes weight and bias (i.e. weight x inputs). 
— Move forward to the hidden layer and assign an activation function to every output from it. 
— The outcome is displayed on a single output layer (i.e. at the softmax layer). 


Out[81]: Text(@.5, 1.0, ‘Precision-Recall Curve for Neural Network’) Confusion Matrix for Neural Network 
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Figure 10. Heart disease prediction for NN model using (a) PRC plot and (b) confusion matrix 


b. Artificial neural network 

Artificial neural networks (ANNs) are a versatile and effective tool that can assist doctors in 
analyzing, and modeling complex clinical data in a variety of medical settings. The proposed system creates 
an effective methodology for collecting clinical and electrocardiogram (ECG) data to train an ANN to detect 
heart problems [25]. A smart health care framework, for identifying CVD generates predefined numbers of 
NNs after training and evaluating the framework [26], [27]. Figure 11 shows the result of ANN model on test 
data, where Figure 11(a) shows the PRC plot and 11(b) gives the result of the confusion matrix that predicts a 
total of thirty-five HD patients. 
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Figure 11. Heart disease prediction for ANN model using (a) PRC plot and (b) confusion matrix 


3. PERFORMANCE METRICS 

The performance of a model can be measured using performance metrics. Thus, to measure the 
performance of all nine models, we have taken four performance metrics precision, recall, accuracy, and F1 
score as shown in (1) to (4). Precision (P) determines the accurateness used to assess a classifier's 
performance as shown in (1). Recall (R) determines the classifier's completeness as shown in (2). When R 
improves, P often suffers as a result. Accuracy is for identifying the best model for finding relationships and 
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patterns between variables as shown in (3). The F-score is a combination of accuracy and recall that may be 
measured using the (4). 


4.1. Precision (P) 

P = TP / (TP+FP) (1) 
4.2. Recall (R) 

R = TP / (TP+FN) (2) 


Where, 
TP, FP, and FN stand for True Positive, False Positive, and False Negative respectively. 


4.3. Accuracy 
Accuracy = (TP+TN) / (TP+TN+FP+EN) (3) 
4.4. F1 Score 


F1 Score = 2 * ((P * R)/ (P + R)) (4) 


4. RESULT AND DISCUSSION 

HD prediction at an earlier stage may help to avoid deaths from heart attacks. A good classification 
system can assist a doctor in predicting the existence of CVD before it happens. HD dataset has been 
collected from the UCI repository, consisting of 303 records, which undergo pre-processing to remove all the 
null values if present in the dataset. Furthermore, for the ML side of this project, the sklearn module and, for 
the DL side, Keras layers are used. Further, to evaluate all nine models successfully, the next move forward 
is to measure the performance by using a cm and ROC Curve, as described in above Figure 12. Figure 12(a) 
represents the cm and Figure 12(b) shows the ROC curve, containing the TPR (true positive rate) and FPR 
(false positive rate) for all nine models. This plot demonstrates that LR (indicating orange color) model 
proves as the best predictive classifier (i.e. accuracy of 90.7%) in detecting heart attack. 
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00 02 04 06 08 10 
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Figure 12. Comparative models prediction for HD using (a) performance metric and (b) ROC Curve 


5. CONCLUSION 

The leading cause of death in the country is HD. Manually calculating the chances of developing 
HD based on risk factors such as age and sex is difficult. To predict the output from data, the relevant 
technologies of ML and DL are applied that might give a significant result. They can assist professionals in 
identifying, diagnosing, and treating cardiovascular disease. In this work, a total of nine classifiers are tested 
and measured using a cm for predicting a heart attack. Moreover, a PRC plot has been carried out that 
entirely focuses on all models’ performance on the test set. The results demonstrate that the LR model 
outperforms the strategies with 90.7 percent accuracy, which specifies that this model has a greater extent of 
identifying HD. In conclusion, it might find that the LR model would predict more effectively on the real- 
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time HD datasets. However, both ML and DL advances in healthcare will continue to revolutionize the 
business in the future. 
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